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Abstract —In the near future, massively parallel computing 
systems will be necessary to solve computation intensive ap¬ 
plications. The key bottleneck in massively parallel implemen¬ 
tation of numerical algorithms is the synchronization of data 
across processing elements (PEs) after each iteration, which 
results in significant idle time. Thus, there is a trend towards 
relaxing the synchronization and adopting an asynchronous 
model of computation to reduce idle time. However, it is not 
clear what is the effect of this relaxation on the stability and 
accuracy of the numerical algorithm. 

In this paper we present a new framework to analyze 
such algorithms. We treat the computation in each PE as 
a dynamical system and model the asynchrony as stochastic 
switching. The overall system is then analyzed as a switched 
dynamical system. However, modeling of massively parallel 
numerical algorithms as switched dynamical systems results in 
a very large number of modes, which makes current analysis 
tools available for such systems computationally intractable. We 
develop new techniques that circumvent this scalability issue. 
The framework is presented on a one-dimensional heat equation 
and the proposed analysis framework is verified by solving the 
partial differential equation (PDE) in a nVIDIA Tesla™ GPU 
machine, with asynchronous communication between cores. 

I. Introduction 

Exascale computing systems will soon be available to 
study computation intensive applications such as multi¬ 
physics multi-scale simulations of natural and engineering 
systems. Many scientific and practical problems can be 
described very accurately by ordinary or partial differential 
equations which may be tightly coupled with long-range 
correlations. These exascale systems may have O(10 5 —10 6 ) 
processors ranging from multicore processors to symmetric 
multiprocessors [1]—[3]. Furthermore, such systems are likely 
to be heterogeneous using both heavily multi-threaded CPUs 
as well as GPUs. Many challenges must be overcome before 
exascale systems can be utilized effectively in such appli¬ 
cations. One such obstacle is the communication in tightly 
coupled problems during parallel implementation of any 
iterative numerical algorithm. This communication requires 
massive data movement in turn leading to idle time as the 
cores need to be synchronized after each time step. 

Recent literature has proposed relaxing these synchro¬ 
nization requirements across the PEs [4], This potentially 
eliminates the overhead associated with extreme parallelism 
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and significantly reduces computational time. However, the 
price to pay is loss of predictability possibly resulting in 
calculation errors. Thus, a rigorous analysis of the tradeoff 
between speed and accuracy is critical. This paper present 
a framework for quantifying this tradeoff by analyzing the 
asynchronous numerical algorithm as a switched dynamical 
system [5]—[ 14]. While there is a large literature for analysis 
of such systems, these techniques are not applicable to our 
application. The reason is that due to the large number of 
PEs, the switched system model has an extremely large 
number of modes, which makes the available analysis tools 
intractable. Key contributions in this paper include new 
techniques for a) stability analysis, or quantification of 
steady-state error with respect to the synchronous solution; b) 
convergence rate analysis of the expected value of this error; 
and c) probabilistic bounds on this error. These techniques 
are developed to be computationally efficient, and avoid the 
aforementioned scalability issue. 

The paper is organized as follows. Section II addresses 
the problems for the asynchronous numerical algorithm. In 
section III, we introduce a switched system framework to 
model the system structure for the asynchronous numerical 
scheme. The stability results are presented in section IV, 
and section V shows the convergence rate analysis. Then, 
the error analysis in probability is developed in section VI. 
Section VII demonstrates the usefulness of the proposed 
method by examples. Finally, section VIII concludes this 
paper. 


II. Problem Formulation 


Notation: The symbol || • || and || • Hoc stand for the 
Euclidean and infinity norm, respectively. The set of positive 
integers are denoted by N. Further, No = NU{0}. Also, A(-) 
represents an eigenvalue of a square matrix. In particular, 
^max(') and ^min(') denote the largest and the smallest 
eigenvalue in magnitude, respectively. The symbols (g), det(-), 
tr(-), and vec(-) denote Kronecker product, matrix determi¬ 
nant, trace operator, and vectorization operator, respectively. 
Finally, the symbol Pr(-) stands for the probability. 

In this paper we demonstrate our framework and tech¬ 
niques on the one-dimensional heat equation, given by 


du d 2 u 
dt a dx 2 ’ 


t > 0 , 


( 1 ) 


where u is the time and space-varying state of the tempera¬ 
ture, and t and x are continuous time and space respectively. 
The constant a > 0 is the thermal diffusivity of the given 
material. 



The PDE is solved numerically using the finite difference 
method by Euler explicit scheme, with a forward difference 
in time and a central difference in space. Thus 0 is 
approximated as 

Ui(k + 1 ) - Ui{k) f u i+1 (k) - 2 Ui(k) + u i _ 1 {k)\ 

At ~ “ V Ax 2 ) ’ 

( 2 ) 

where k £ No is the discrete-time index and m is the 
temperature value at i th grid space point. The symbols At 
and Ax denote the sampling time and the grid resolution 
in space, respectively. Further, if we define a constant r = 
a 7 ^ 2 , then Q can be written as 

Ui(k + 1) = ru i+1 (k) + (1 - 2r)ui(k ) + ru i - 1 (k), (3) 

It is important to observe that ([3]) is a discrete-time linear 
dynamical system. 
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Fig. 1. Discretized one-dimensional domain with an asynchronous nu¬ 
merical algorithm, the PE denotes a group of grid points, assigned to each 
core. 


Fig.[T]illustrates the numerical scheme over the discretized 
ID spatial domain. A typical synchronous parallel implemen¬ 
tation of this numerical scheme assigns several of these grid 
points to each PE. The updates for the temperature at the grid 
points assigned to each PE, occur in parallel. However, at 
every time step k, the data associated with the boundary grid 
points, where the communication is necessary are synchro¬ 
nized, and used to compute m(k + 1). This synchronization 
across PEs is slow, especially for massively parallel systems 
(estimates of idle time due to this synchronization give 
figures of up to 80% of the total time taken for the simulation 
as idle time). Recently, an alternative implementation which 
is asynchronous has been proposed. In this implementation, 
the updates in a PE occur without waiting for the other 
PEs to finish and their results to be synchronized. The data 
update across PEs occurs sporadically and independently. 
This asynchrony directly affects the update equation for the 
boundary points, as they depend on the grid points across 
PEs. For these points, the update is performed with the 
most recent available value, typically stored in a buffer. The 
effect of this asynchrony then propagates to other grid points. 
Within a PE, we assume there is no asynchrony and data is 
available in a common memory. 

Thus, the asynchronous numerical scheme corresponding 
to 0 is given by 

Ui(k + 1 ) = ru i+ i{k* +1 ) + (1 - 2r)ui{k) + rui-^k*^), 

(4) 


where k* £ {k,k—l,k — 2,...,k — q + l}, i = 1,2,..., N, 
denotes the randomness caused by communication delays 
between PEs. The subscript i in k* depicts that each grid 
space point may have different time delays. The parameter 
q is the length of a buffer that every core maintains to store 
data transmitted from the other cores. In this paper, we treat 
k* as a random variable and thus 0 can be considered to 
be a linear discrete-time dynamical system with stochastic 
updates. 

Although 0 is derived for the ID heat equation, the 
treatment above can be developed for any parabolic PDEs. 
This observation encourages us to consider using tools from 
dynamical systems to analyze the effect of asynchrony in 
parallel numerical algorithms. Therefore, the primary goal of 
this study is to investigate the stability, convergence rate, and 
error probability of the asynchronous numerical algorithm in 
the framework of stochastic switched dynamical systems. 

III. A Switched System Approach 

Let us define the state vector Uj(k) £ R n = 
[«{(/c), u J 2 (k), ..., u^(fc)] T , where uj(k) stands for the i th 
grid space point in the j th PE and n is the total number of 
grid points in the j th PE. Therefore, ([3]) can be compactly 
written as 


U(k + 1) = AU(k), k £ N 0 , 


where U{k) £ I*" A [U 1 (k) T , U 2 {k) r ,U N (k) T ] T , N 

is the total number of PEs, n is the size of the state for each 
PE, and system matrix A £ ]^ NnxNn j s given by 


ri 


A = 


r 

0 


0 

l-2r 

r 


0 

r 

l-2r 


0 

r 


O' 

0 

0 




0 r l-2r r 

0 • • • 0 0 1 


Note that the first and the last row of A matrix specify 
the Dirichlet boundary conditions (see pp. 150, [15]). i.e., 
we have the constant in time boundary temperatures for 
simplicity. 

Next, we define the augmented state X{k) £ M. Nnq A 
[U(k ) T , U{k — 1 ) T ,... ,U(k — q+ 1) T ] T , where, as stated 
before, q is the buffer length. For pedagogical simplicity (and 
without loss of generality), we consider the case with q = 2 
and N = 3. Further, we let n = 1, which implies there is 
only one grid point in each PE. For this particular case, we 
construct the following matrices. 
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where I £ R NnxNn anc j q ^ NnxNn are t j le identity and 
the zero matrices with appropriate dimensions. As in [4], we 
assume that the condition 0 < r < 0.5 holds from now on. 
The asynchronous numerical scheme can then be written as 
a switched system 

X(k + 1) = W ak X(k), ct* £ {1,2,...,m}, k £ No, (5) 

where the matrices W CTfc £ R-WngxiVn^ the subsystem 
dynamics. In general, the total number of switching modes 
is m = q 2 W~ 2 ) that is obtained by considering all cases 
to distribute every components r in W\ matrix, where the 
number of r in W\ is given by 2(N — 2), into q numbers of 
zero block matrix as in the above example. Therefore, the 
number of modes increase exponentially with the number of 
PEs, which is quite large for massively parallel systems. 

At every time step, the numerical scheme evolves using 
one of the m modes, which depends on the variable k*. In 
this paper, we model the variable k* as a random variable that 
evolves in an independently and identically distributed (i.i.d.) 
fashion in time, and independently from one core to the next. 
Hence, we let tt ? be the modal probability for Wj which is 
assumed to be stationary in time. Let n = {tti,tt 2 , ■ ■ ■, 7r m }, 
be the switching probabilities such that 0 < itj < 1, Vj 
and i n j = 1- The system in 0 is thus an i.i.d jump 
linear system, which is a simpler case of the more well- 
known Markovian jump linear systems [13]. Even though 
the analysis theory for such systems is well developed, the 
existing tools are not suitable for our application because of 
the extremely large number of modes, particularly when N 
is large. Thus, we now develop an analysis theory for the 
i.i.d. jump linear systems which scales better with respect to 
the number of modes. 


IV. Stability 


The first requirement is that of convergence of (|5j. Be¬ 
cause of the Dirichlet boundary conditions, we expect the 
temperature to converge to a constant value for every grid 
point. We proceed to analyze the conditions for conver¬ 
gence (or stability) of the system. To this end, we may try 
to use the infinity norm and apply the sub-multiplicative 
property to obtain || X(k + 1)||oo = \\W ak X{k)\\ 00 < 
||W 0 -J| 00 ||X(fc)|| 00 = ||X(fc)||oo, where the last equality 
holds since we have | W :) \ \^ = 1, Vj. This can be written 
as 


II X(k + 1) Hop 

ll*(*) 11=0 “ ' 


(6) 


The above result only shows that the solution from the 
asynchronous algorithm is marginally stable and we are 
unable to determine the steady-state solution. 

In fact, we can show that the asynchronous scheme also at¬ 
tains the same steady-state value as the synchronous scheme, 
regardless of the specific realization of {<7/-}. Using spectral 
decomposition, the matrices Wj can be expressed in terms 
of the eigenvalues and corresponding eigenvectors as 


Nnq 

Wj £ M^xivn, = £ a 144, J = {1,2,..., m}, (7) 

i—1 


where A \ £ R, v\ £ K JVn 9 xl ; and s? £ U. lxNnq denote the 
eigenvalues, right eigenvectors, and left eigenvectors of Wj, 
respectively. 

Since max|A^| < (| IT j 11 x = 1, Vj, the spectral radius of 

i 

Wj, j = 1,2,, ni, is less than or equal to 1. Therefore, 

we may order the eigenvalues as 1 > |Aj| > |Aj| > ••• > 

IA J Nnq | > 0. It can be shown that all Wj have two eigenvalues 
with value 1, irrespective of the size of q and N. Therefore, 
the eigenvalues for Wj are ordered as 1 = |Aj| = Aj| > 

\X 3 \ > ■ ■ ■ > |A^ ng | > 0 . 

Moreover, the left and right eigenvectors for eigenvalues 
equal to 1 are common eigenvectors for all matrices Wj, 
j = 1,2,..., m. These common left and right eigenvectors 
are 

1) Left eigenvectors: 

sr = [1,0, - - - ,0 , 0 ,... , 0 ] £ R lxNnq , (8) 

s 2 = [0, - - - ,0,1 , 0 , • •• , 0 ] £ R lxJV " 9 , (9) 

2) Right eigenvectors: 

= ,fii] T eR Nn9Xl , ( 10 ) 

v 2 = ■ ■ ' , M2] T € R Ar ™ 9Xl , (11) 


plxNn 


where 0 £ 
elements, and m = [l,^Ef, 
R lxNn, ^ A [0, 


denotes a row vector with all zero 

Nn —2 Nn—j 1 fll r 

’ Nn-1 ’ “ ’ ’ Nn—1 ’ U J ^ 
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R lxNn , j = 1,2 ,...,Nn. 

Notice that we have Wj Vi = v. L and SiWj = Sj, i = 1, 2, 
Vj. Then, the steady-state value for the asynchronous scheme 
is given by the following result. 

Proposition 4.1: Consider the i.i.d. jump linear system 
in 0 with subsystem matrices Wj, j = 1,2 ,...,to and 
a stationary switching probability W For a given initial 
condition AT(0), if we define = uisi + V 2 S 2 , where u, 
and Si, i = 1,2, are given in then, the steady-state 

value X ss has the following form: 


= lim X(k) = l'X(O), 

k—> 00 

irrespective of the switching sequence {oa}. 

Proof: Let the eigenvalues of Wj be ordered in mag¬ 
nitude by 1 = |Aj| = |A 2 | > |A|| > ■ •• > \X J Nnq \ > 0. Also, 
let vf and s\ be the right and left eigenvector corresponding 
to Aj, respectively. Using the spectral decomposition, Wj can 
be alternatively expressed by Wj = Y^t=i K v i s i = T 1 + 
Xx/i where T = uiSi + v 2 s 2 and f j (i) = A' rjsf 
Then, starting with X(0), the realization of the switching 
sequence cr*, results in 


X{k) = • • • W CT1 W CT0 X(0) 

= (vp + y, n - 1 «) •■■(*+ E f ao (o)- Y (o) 

U fc -Vi VVI 

= (* k + g(k))x( 0), 

where in above equation, g(k) represents all the other multi¬ 
plication terms except T /,: term. Note that <j(k) is formed 










by the product of Aj, where 0 < | A] < 1, Vi > 2, 
Mj. Consequently, if k —> oo, then g(k) is asymptotically 
convergent to zero since the infinite number of multiplication 
of the term Aj, Vi > 2, converges to zero. Therefore, we have 

X ss = lim X(k) = lim tf fc X(0) = T'X(O). 

k—foo k—f oo 

The last equality in above equation holds because = 

'T A; ~ 1 = •■■ = $, Vfc e n. ■ 


V. Convergence rate 


In this section, we investigate how fast the expected value 
of the state converges to the steady-state X ss by analyzing 
the transient behavior of the asynchronous algorithm. Let 
us define a new state variable e(k) = X(k) — X ss . The 
expected value of e(k) is given by e(k) = E[X(fc) — A ss ] = 
E[X(k)) - X ss = X{k) - X ss , where X{k) = E[X(fc)]. 
Therefore, the convergence rate of ||e(fc)| | will provide bound 
for the convergence rate of ||A(fc) — X ss | |. 

To obtain an upper bound for the convergence rate of 
||e(fc)||, we use the following matrix transformation. As 
described in 0, each modal matrix Wj can be alterna¬ 
tively expressed by Wj = Y^i=i K v l s i’ where Aj, vj, 
and sj denote the eigenvalues, right and, respectively, left 
eigenvectors for Wj. If we define the transformed matrix 


Wj = Wj - Y,\l=iK v i s l =Wj-\t = V. v; / , Aj rjsj. 

then the modal dynamics with the corresponding state ej(k), 
is given by 


ej(k + 1) = Wjej(k), j = {l,2,...,m},feeN 0 . (12) 

Moreover, as in 0, the error state e(k) = X(k) — X ss , is 
governed by 


e(k + 1) = W ak e{k), cr fe e {1,2,... ,m}, k G N 0 . (13) 


The system in ( fl3]l is also a switched linear system. The 
transformed matrix W :l are the modes of the error dynamics. 
Generally, it is difficult to estimate the convergence rate of 
the ensemble with stochastic jumps. Previous works [16]— 
[19] have used the common Lyapunov function approaches, 
to analyze stability and the convergence rate. However, the 
existence of a common Lyapunov function is the only suffi¬ 
cient condition for the system stability, and hence there may 
not exist a common Lyapunov function for the asynchronous 
algorithm. Moreover, extremely large values of m make it 
very difficult to test every conditions for the existence of such 
a common Lyapunov function. For this reason, we bound the 
convergence rate of e(fc), instead of bounding e(k) directly. 

Lemma 5.1: Consider an i.i.d. jump linear system 
given by C3 with the switching probability n = 
{7Ti, 7T2 ,..., 7r m }. If the initial state e(0) is given and has 
no uncertainty, the expected value of e{k) is updated by 

e(k ) = E {e(k)\ = A fc e(0) or e(k + 1) = A e(k), (14) 

m 

where A = ttiWj. 

i=l 


Proof: For an i.i.d. jump process with a given deter¬ 
ministic initial error e(0), we have 

E[e(k)\=E[W ak _ 1 e(k-l)\ 

= E[W ak _ 1 W ak _ 2 ...W„ 1 W ao e(0)] 

= E[WVJ ... E[W ai ] E[lC ff0 ] e(0) = A fc e(0). 

=A —A —A 


Since the matrix A is given by A = YlnLi tttWi, the 
computation of A requires all matrices Wj, j = 1, 2,..., m. 
As pointed out earlier, this calculation is intractable due 
to the extremely large number of the switching modes m. 
Therefore, instead of using ( fl4] i, we provide a computation¬ 
ally efficient method to bound ||e(fc)|| through a Lyapunov 
theorem. 

Consider a discrete-time Lyapunov function V(k) = 
e(fc) T Pe(k), where P is a positive definite matrix. Since 
it is shown that the original state X(k) is convergent to the 
unique steady-state X ss as k —> oo irrespective of {cr*}, 
the expected error e(k) = X(k) — X ss is asymptotically 
stable. Therefore, one can employ the Converse Lyapunov 
Theorem [20], which guarantees the existence of a positive 
definite matrix P, satisfying the following linear matrix 
inequality (LMI) condition A T PA — P < —Q, where Q is 
some positive definite matrix. The matrix inequality can be 
interpreted in the sense of positive definiteness, (i.e., A > B 
means the matrix A — B is positive definite.) Then, the above 
LMI condition results in A V(k) = V(k + 1) — V(k) = 
e(fc) T (A T PA - P)e(fc) < -e(fc) T Qe(fc) < -A min (Q) 
|| e(k) || 2 . Also, the Lyapunov function V(k) satisfies 


A min(P) || e(k) II 2 < V(k) < A max (P) II e(k) 


resulting in — || e(k) || 2 < — 


V(k) 

^max (■ p ) 


. Therefore, we have 


A V(k) < -A min (Q) || e(k) || 2 < - 


^min (Q) 


v(k). 


^max (-^) 

^min (Q) 

^max (P) 

Hence, || e(k) || is bounded by a following equation: 


V{k + 1) < (l 


)v(k). (15) 


e(k) || 2 < K 1- 


^min (Q) 

^max (P) 


m 


(16) 


where K > 0 is some constant. 

Next, we bound the convergence rate for ||e(fc)|| by using 
the result in (fl6|) as follows. 


Proposition 5.1: For a stable i.i.d. jump linear system 
G3 with a stationary switching probability n, consider 
a Lyapunov candidate function for the state e, given by 
V = e 1 Pc. where P is a positive definite matrix. In addition, 
a Lyapunov candidate function for ( | 1 2\ is given by Vj = 
ejPj&j, j = 1,2where Pj is a positive definite 
matrix. According to the Converse Lyapunov Theorem, there 
exist Pj > 0 and P > 0 such that WjPjWj — Pj < 
-Qj, j = 1,2 ,... ,m and A T PA — P < — Q, where Qj and 







Q are any positive definite matrices. Then, with a particular 
choice of these matrices, we assume that Pj and P satisfy 
the following conditions: 


WjPjWj-Pj = j = 1, 2,..., m, (17) 


A T PA - P < —Sjl, 

A max (P) 

W) 

m 

( [T2| >, and A = Hj Wj. 

3=1 

Then, ||e(fc)|| 2 is bounded by 


for some j, 


(18) 


where Sj = x ' > 0, Wj are the modal matrices in 

'Xmax j ) 


e{k ) || 2 < K ( 1- 


A max (Pj) 

where A > 0 is some constant. 


1 


s(0) II 2 , 


(19) 


Proof: By applying the result in (fl6|) into (|T8|), we have 


e(k) || 2 < K 1- 


= K |^1 - 
= K (1 - 


Amm(V?-0 

A max (p) 

£ j 

^max (P) 

l 

^max (Pj) 


e(0) 


e(0) 


e(0) 


The last equality in above equation holds by the definition 

Of £j. ■ 

Proposition ED says that we can always guarantee the 
bound for ||e(fc)|| if ( fT8f ) holds. Consequently, the existence 
of such a P, satisfying ( fTS) is the major concern in order 
to guarantee the bound ||e(fe)||. The following lemma and 
theorem can be used to prove the existence of such a P. 

Lemma 5.2: Suppose that Pj is a positive definite matrix, 
satisfying Then, the largest eigenvalue of Pj is strictly 
greater than 1 for all j, i.e., A m ax(Pj) > 1, Vj. 

Proof: From ( | 1 7[ ), Pj = WjPjWj + 1, Vj. Then, with 
the eigenvectors y £ M. Nnq of Pj, the largest eigenvalue of 
Pj is given by its definition as follows: 


A max(Pj) — A max(wj PjWj + I) 

= max y T {Wj PjWj + I)y 
\\v\\ 2 =i 

= max (y T WjPjWjy) + y T y 
\\y\\ 2 =i =\\y\\ 2 =i 


Since Pj is a positive definite matrix, Wj PjW j becomes 
a positive semi-definite matrix at least. Then, the scalar 
term y T WjPjWjy cannot be zero unless WjPjWj is 
a zero matrix or a triangular matrix with zero diagonal 
components, which is not the case. Hence, it is guaranteed 
that y T Wj PjWjy > 0, implying A m ax(Pj) > 1, Vj. ■ 
Theorem 5.1: Consider Lyapunov functions for © and 
m given by V 3 = ej Pjej, j = 1,2and V = 
e T Pe, respectively, where the matrices Pj > 0, Vj and P > 
0. By the Converse Lyapunov Theorem, we assume that the 
matrices Pj, Mj, satisfies the condition (fl7]>. 


Then, there exists a positive definite matrix P such that 


A t PA — P < — £jl , for some j , (20) 


where Ej = 


(P)_ 

(Pj) 


> 0 . 


Proof: We prove by contradiction. Suppose that there 
exist no such P > 0, satisfying ( [20] ), which is equivalent to 
that for all matrices P > 0, the inequality A 1 PA — P > 
—£jl holds Vj. The above inequality can be interpreted in 
the quadratic sense. In other words, for any non-zero vector 
v that has a proper dimension, the following condition holds: 


v T (A t PA - P + Ejl) v > 0, Vj (21) 


As a particular choice of v, we let the vector v be the 
eigenvector of the matrix A, i.e., Av = AA, where A is the 
eigenvalue of A. Since ( |2Tj ) holds for any matrix P > 0, we 

let P = I, which results in e 3 = ^ max (-0 _ 


Hence, we have 


^max (Pj) ^max (Pj)' 


0 < v T A t A -/ + 


1 

^max (Pj) 


= (jVuJ — 

=Xv —Xv 


i 


= A 2 - 1 


1 

A max (Pj) 


^max (Pj)' 
kll 2 , Vj. 


From the structure of the matrix A, it can be shown that 
det(A) = 0. Ther efore , one of the eigenvalues A is zero. 

< 1, Vj. As a 


5.2 


states that 


Moreover, Lemma 
consequence, with A V 0, we have 

1 


0 < -1 + 


A max (Pj) 


A max (Pj) 

Ml 2 < o, 

>o 


Vj- 


<o 


which is a contradiction. ■ 

Remark 5.1: Proposition |5 .1 1 provides a very efficient way 
to bound the convergence rate for ||e(fc)||. According to the 
proposed methods, it is unnecessary to compute the matrix 
A and to keep all matrices Wj, j = 1,2since 
|| e(k) 11 is bounded by the proposed Lyapunov function. Also, 
Theorem [5T] guarantees the condition ( p~8j ), which is assumed 
in Proposition |5.1| 

Note that we specify the modal matrix W m in 0 as the 
most delayed case - all PEs use the oldest value in the buffer. 
Therefore, it can be inferred that A max (Pm) > A maa; (P 7 j, 
Vj, which results in 

||e(fc)|| 2 < A'(l - 1 J '||e(0)|| 2 , (22) 

where K is a positive constant. Therefore, the only infor¬ 
mation required to compute the convergence rate of ||e(fc)||, 
is the matrix W m with the corresponding positive definite 
matrix P m . As a result, the rate of convergence can be 
calculated by the proposed methods without any scalability 
problems. 




















VI. Error Analysis 

In this section, we investigate the error probability, which 
quantifies the deviation of the random vector X{k) from its 
steady-state value X ss in probability. To measure this error 
probability, the Markov inequality given by Pr(X >e) < 
EX 

——where X is a nonnegative random variable and e is 
a positive constant, is used. First of all, we investigate the 
term vec (e(fc)e(fc) T ) as follows: 

vec (e{k)e{k) T ) = vec {w aki e{k - l)e(fc - l) r Wj ki ) 

= (W4 fc _i <S> W CTfc _ 1 )vec(e(fc - l)e(fc - 1) T ). (23) 

In the second equality of above equation, we used the 
property that vec {ABC) = ( C T 0 A)vec(B). 

By taking the expectation with new definitions y(k) = 
vec (e(fc)e(fc) T ), y{k) = E[j/(fc)], and T ak = W ah 0 W ak , 
m becomes 

m = nym=®[r« k - 1 y(k-i)] 


= E E 

r—1 


Fa^yik-l) 


o-fc -1 = r 


Pr(<7/c-i = r) 


= E ’ttve iy( k ~ !)] > 


resulting in y{k) = (EEi 7r r r r ) y{k — 1), where in the 
second line we applied the law of total probability and 
the last equality holds by Pr(ak-i = r) = n r for i.i.d. 
switching. 


By the e xactl y same argument given in Lemma 5.1 and 
Proposition 5.1 the upper bound for y(k) is obtained as 
follows: 


llw(*)ll <X 1- 


^max (P m ) 


k /2 


lly(o)||, VfceN, (24) 


where K is some positive constant and P m is a positive 
definite matrix, satisfying the condition T^P m r m — P m = 
—I. However, unlike the positive definite matrix P m £ 
^NnqxNnq j n ( p~ 7 ^ the dimension of the matrix P m is given 
by P, n £ Kl A,r *9r x ( Ar "9) 2 i which may be large in size, and 
hence incurs computational intractabilities to obtain such a 
P m . Therefore, we introduce the following proposition and 
theorem in order to further facilitate the computation of 
A max(Pm) as follows. 

Proposition 6.1: Consider a positive definite matrix P m , 
satisfying the condition r^P m r m — P m — —I, where T m = 
W'm 0 W m . and W rn is any real square matrix. If we assume 
that there exist finite, positive constants ko, cq, and C\ such 
that 

1 < ||W£|| 4 < Co, for k £ [0, ko), (25) 

IIW^II 4 < ci <1, for k £ [fc 0 ,oo), (26) 

then, the largest eigenvalue of P m is bounded by the follow¬ 
ing function: 


A ma ,(P m )<^||lP^|| 4 <A: 0 Cof T 


k —0 


1 


- Cl 


(27) 


Proof: The leftmost inequality in ( [27] ) can be proved 
as follows. The positive definite matrix P m satisfying the 
condition I’, 1 ,, P m T m — P m = —I, is analytically computed 

by Pm = EZo ( r m fc ) 1 (r*,) = Er=o r m k r k m- Then, for 
a given matrix T m = W m 0 W m , we have 

r„ fe r^ < p(rl k r L m )i = P (r k m T r. k ji = a 2 max (r k m )i 
= ll r ™l| 2 T = \\{Wm 0 w m ) k \\ 2 l = \\W^\\ 4 I, (28) 

where p(-) and a max (-) denote the spectral radius and 
the spectral norm, respectively. For equalit y conditions in 

( |28) l, we used the known property that \J P (T!f T T!f) = 
<w( T k J = IIT^H and \\(W m ®W m ) k \\ = ||w*0W*|| = 
l|b^m|| 2 , Vfc £ No- By summing up from k = 0 to oo, 
and then taking the largest eigenvalue in ( |28| , we have 

A max(Pm) = Xma X < E^O ll^l| 4 - 

For the rightmost inequality in ( [27] ), the assumptions in 
result in 
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Hence, we have E lEml | 4 < ^oco ( - 

V 1 “ ci 


k -0 


Theorem 6.1: Consider a stable, i.i.d. jump linear system 
with subsystem dynamics Wj given in ( fl3| . Then, the prob¬ 
ability of ||e(fc)|| 2 > e, where e is some positive constant, is 
given by 


Pr 


|e(fc)||" > e ) < min(l,/3), k £ N 0 , 

k/2 


(29) 


112/(0)11, K > 0 is 


where 0 4 LA ( l _ LlA 
e k 0 co / 

a constant, co,Ci,fco are positive constants such that the 
conditions ([25l)-(l26l» are satisfied. 

















Proof: At first, we consider the following equality 
condition given by 

||e(fc)|| 2 = e{k) T e(k) = tr(e(A:) T e(fc)) = tr(/ (e(fc)e(fc) T )) 
= vec(/) T vec(e(fc)e(fc) T ) = vec (J) T y(fc), (30) 

where we used the cyclic permutation property for the trace 
operator in the first line and the equality in the second line 
holds by the property tr(X T F) = vec(X) T vec(Y) for any 
square matrix X,Y £ R nx ". 

We take the expectation in both sides of ( [30] ), which leads 
to 

E [||e(/c)|j 2 ] = vec(/) T E [?/(&)] = vec (J) T y(fc). (31) 
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Since the term E[||e(fc)|| 2 ] is a scalar value, taking the 
Euclidean norm returns the same value. Hence, applying the 
Euclidean norm in ( |3T] > results in 

E [l|e(fc)|| 2 ] = ||vec(J) T j/(fc)|| 

< l|vec(I) T || • \\y(k)\\ =Vn- ||j/(fc)||- (32) 
Now, plugging ( [24| and ( [27] ) into ( [32] ) leads to 

/ 1 - \ fc/2 

E 0|e(A:)|| 2 ] <VnKi 1-^-1) ||y(0)||. 

Finally, by applying the Markov inequality the above equa¬ 
tion ends up with 


Fig. 2. The spatio-temporal change of the temperature. Initially, the 
temperature was given by the cosine square function. The total grid points 
are 100, and the simulation was terminated when k = 10000. 
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Since the probability cannot exceed one, we have 


Pr(||e(fc )|| 2 > ej < min(l,/3) ■ 

Theorem |6.1| represents the error probability for a given 
bound e. Since e(k) is a time-varying variable, the probability 
Pr (||e(fc )|| 2 > e) also changes with respect to time. Starting 
from a given initial condition y(0), this probability will 

converge to zero if [ 1- - —— | < 1. 

\ koco J 

VII. Simulations 


In order to test the proposed methods, simulation was 
carried out for the one-dimensional heat equation. We 
implemented the asynchronous parallel algorithm with 
CUDA C++ programming on nVIDIA Tesla™ C2050 GPU, 
which has 448 CUDA cores. The simulations were performed 
with the following parameters: 

• Simulation Parameters: 

At 

Ax = 0.1, At = 0.01, a = 0.5, r = a— —^=0.5 

Ax i 

I.C.: Ui =cos 2 ( ^^ i = l,2,...,N 

B.C. : u\(k) = 1, uw(k) = 0, VA: 


• Buffer length: q = 3 


Fig. 3. The results for the stability and convergence rate, (a) The solid 
lines represent the ensembles of total 300 simulations. The synchronous 
case is given by dashed line. The steady-state is depicted by starred line, 
(b) The solid and dotted lines represent 300 ensembles for ||e(/c)|| and the 
normed empirical mean ||e(/c)||, respectively. The dashed line shows the 
upper bound of ||e(/c)|| from the proposed Lyapunov function, respectively. 


• Number of PEs: N = 100. 

• Number of grid points in PE: n = 1 

For a given initial temperature, the spatio-temporal evolu¬ 
tion of the state is presented in Fig. [2] As time k increases, 
the curved shape of the temperature, given as a cosine square 
function initially, flattens out. This simulation represents the 
synchronous case. 

In Fig. [3] (a), the ensemble of the trajectories is shown 
for the asynchronous algorithm. The solid lines show the 
trajectories of total 300 simulations. Due to the randomness 
in the asynchronous algorithm, the trajectories differ from 
each other. For a reference, the synchronous scheme is also 
shown by a dashed line. Although it seems that the syn¬ 
chronous scheme converges faster with respect to the given 
iteration step, the physical simulation time may take more 
because the idle time is necessary at each iteration in the 
synchronous case. As the proposed method guarantees the 
stability through the common eigenvectors, both synchronous 
and asynchronous trajectories converged to the same steady- 
state value X ss , depicted by starred line. 

Next, we present the result for the convergence rate of 
the asynchronous algorithm. We assume that the switching 
probability n has the form of an i.i.d. jump process. Fig. [3] 
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Fig. 4. Error probability with respect to time (a), (b) and with respect to e 
(c), (d). The solid line and dashed line represent empirical error probability 
and empirical Markov inequality, respectively. The cross symbol denotes 
the upper bound for the error probability by the proposed method. 


(b) shows the convergence rate of ||e(fc)||, which describes 
how fast the expected value of the state converges to X ss . 
The solid lines are 300 sample trajectories of ||e(fc)||, starting 
from the given initial condition: e(0) = X(0) — X ss . The 
dotted line depicts the time history of the normed empirical 
mean | |e(fc)11, whereas the dashed line shows an upper 
bound by the proposed Lyapunov method ( [22] ). Note that 
|| e(k) 11 is a random variable, and hence the normed empirical 
mean ||e(fc)|| was obtained by averaging the data over 300 
simulations. In the proposed method, however, it is not 
necessary to execute the simulation multiple times. 

Fig. [4] represents the result for the error probability with 
respect to time and e. For different values of e, Fig. |4] (a) and 
(b) describe the time history of the error probabilities. The 
solid line denotes the empirical probability obtained from 
data - i.e., the number of samples satisfying ||e(fc)|| 2 > e 
divided by the total number of samples. The dashed line 

depicts the Markov inequality, computed from ‘ , 

where E[||e(fc)|| 2 ] is obtained by the statistics. Finally, 
the cross symbols mean the upper bound by the proposed 
method. As shown in Fig. [4] (a) and (b), the probabilities for 
all cases converge to zero since the error is asymptotically 
convergent. 

On the other hands. Fig. [4] (c), (d) show the error proba¬ 
bility with respect to e at fixed time instance. In this result, 
the time is fixed at k = 9000 out of total 10000 iteration 
times, and the probability is computed while increasing e 
values. In Fig. |4] (c) and (d), eT is given by the index along 
i-axis, where the value of T is given in Fig. [4] (c) and (d), 
respectively. In both cases, the error probabilities decrease 
as e increases. 

Although the proposed methods provide a conservative 
bound, it does not require executing the code multiple times 


to predict the convergence rate or the error probability. In 
addition to that, the proposed methods are carried out in a 
computationally efficient manner without storing all subsys¬ 
tem matrices. In this example, we have m = 3 2 ( 10 °— 2 ) ~ 
3 200 , and keeping 3 200 numbers of matrices is intractable 
in the real implementation. The proposed method, however, 
guarantees the convergence rate and the error probability, 
without any scalability issues. Therefore, the presented meth¬ 
ods provide a computationally efficient tool to analyze the 
asynchronous numerical schemes. 

VIII. Conclusions 

This paper studied the stability, convergence rate, and error 
probability of the asynchronous parallel numerical algorithm. 
The asynchronous algorithm achieves better performance 
in terms of the total simulation time, particularly when 
massively parallel computing is required because it doesn’t 
wait for synchronization across PEs. In order to analyze the 
asynchronous numerical algorithm, we adopted the switched 
linear system framework. Although modeling of massively 
parallel numerical algorithms as switched dynamical systems 
results in a very large number of modes, we developed new 
methods that circumvent this scalability issue. While the 
results presented here are based on ID heat equation, the 
analysis approach is generic and be applicable to other PDEs 
as well. 
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